Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Yingxin Zhang, UID: 006140202

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.7    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.1       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(ggplot2)
library(gridExtra)

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:   93.047 MiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer:

First, I copy the Parquet format generated in Homework 2 to the current working directory hw3 and renamed it as labevents_pq. Then I import the necessary data files and filter the data for patient 10013310.

# import data and filter data for patient 10013310
patient_id <- 10013310
patients_10013310 <- read_csv("~/mimic/hosp/patients.csv.gz") |> 
  filter(subject_id == patient_id)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_10013310 <- read_csv("~/mimic/hosp/admissions.csv.gz") |> 
  filter(subject_id == patient_id)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transfers_10013310 <- read_csv("~/mimic/hosp/transfers.csv.gz") |> 
  filter(subject_id == patient_id)
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
labevents_10013310 <- arrow::open_dataset("labevents_pq", format = 'parquet') |>
  select(subject_id, charttime) |> 
  filter(subject_id == patient_id) |> 
  as_tibble()
procedures_icd_10013310 <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |> 
  filter(subject_id == patient_id)
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diagnoses_icd_10013310 <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |> 
  filter(subject_id == patient_id)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_procedures <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_diagnoses <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Then, I read the filtered patients_10013310 and admissions_10013310 data and get the patient’s information used in the title of the visualization: Patient 10013310 is F, 70 years old, BLACK/AFRICAN.

# display the filtered data
patients_10013310
# A tibble: 1 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10013310 F              70        2153 2017 - 2019       2153-11-19
admissions_10013310
# A tibble: 3 × 16
  subject_id hadm_id admittime           dischtime           deathtime
       <dbl>   <dbl> <dttm>              <dttm>              <dttm>   
1   10013310  2.12e7 2153-05-26 14:18:00 2153-06-05 19:30:00 NA       
2   10013310  2.21e7 2153-06-10 11:55:00 2153-07-21 18:00:00 NA       
3   10013310  2.77e7 2153-05-06 18:03:00 2153-05-13 13:45:00 NA       
# ℹ 11 more variables: admission_type <chr>, admit_provider_id <chr>,
#   admission_location <chr>, discharge_location <chr>, insurance <chr>,
#   language <chr>, marital_status <chr>, race <chr>, edregtime <dttm>,
#   edouttime <dttm>, hospital_expire_flag <dbl>

Then I add a column unit to transfers_10013310 to indicate whether the care unit is an ICU/CCU. I also remove missing values in transfers_10013310 and labevents_10013310.

# add col unit to transfers, if careunit contains "ICU" or "CCU", then unit is 10, otherwise 8
transfers_10013310 <- transfers_10013310 |>
  mutate(unit = ifelse(str_detect(careunit, "ICU|CCU"), 10, 8))

# remove missing values in transfers_10013310
transfers_10013310 <- transfers_10013310 |> drop_na()

# remove missing values in labevents_10013310
labevents_10013310 <- labevents_10013310 |> drop_na()

Then I merge procedures_icd_10013310 and d_icd_procedures and insert a column Procedure to procedures_icd_10013310 to indicate the type of procedure (the content before the first comma in long_title). I also change the format of chartdate to POSIXct.

# merge procedures_icd_10013310 and d_icd_procedures
procedures_10013310 <- procedures_icd_10013310 |>
  left_join(d_icd_procedures, by = "icd_code")

# insert column Procedure = the content before the first comma in long_title, if long_title contains a comma
procedures_10013310$Procedure <- sub(",.*", "", procedures_10013310$long_title)

# change the format of chartdate to POSIXct
procedures_10013310$chartdate <- as.POSIXct(procedures_10013310$chartdate)

procedures_10013310
# A tibble: 9 × 9
  subject_id  hadm_id seq_num chartdate           icd_code icd_version.x
       <dbl>    <dbl>   <dbl> <dttm>              <chr>            <dbl>
1   10013310 21243435       1 2153-05-27 00:00:00 4A023N7             10
2   10013310 21243435       2 2153-05-27 00:00:00 B2111ZZ             10
3   10013310 21243435       3 2153-05-27 00:00:00 B241ZZ3             10
4   10013310 22098926       1 2153-06-10 00:00:00 03CG3ZZ             10
5   10013310 22098926       2 2153-06-10 00:00:00 3E05317             10
6   10013310 22098926       3 2153-07-15 00:00:00 0DH63UZ             10
7   10013310 22098926       4 2153-06-11 00:00:00 3E0G76Z             10
8   10013310 27682188       1 2153-05-06 00:00:00 027034Z             10
9   10013310 27682188       2 2153-05-06 00:00:00 B211YZZ             10
# ℹ 3 more variables: icd_version.y <dbl>, long_title <chr>, Procedure <chr>

Then I merge diagnoses_icd_10013310, transfers_10013310 and d_icd_diagnoses and pull the top 3 diagnoses. I also draw the ADT history for patient 10013310.

# merge diagnoses_icd_10013310 and d_icd_diagnoses
diagnoses_10013310 <- diagnoses_icd_10013310 |>
  left_join(d_icd_diagnoses, by = "icd_code")

# change the colname "long_title" to "long_title_diagnoses"
colnames(diagnoses_10013310)[colnames(diagnoses_10013310) == "long_title"] <- "long_diagnoses"

# pull top 3 diagnoses
top3_diagnoses <- diagnoses_10013310 |>
  count(long_diagnoses, sort = TRUE) |>
  slice(1:3) |>
  pull(long_diagnoses)

top3_diagnoses
[1] "Acute on chronic systolic (congestive) heart failure"
[2] "Hyperlipidemia, unspecified"                         
[3] "Long term (current) use of insulin"                  
# draw the ADT history for patient 10013310
plot <- ggplot() +
  geom_segment(data = transfers_10013310, aes(x = intime, xend = outtime, 
    y = 3, yend = 3, color = careunit, size = unit)) +
  geom_point(data = labevents_10013310, 
             aes(x = charttime, y = 2), shape = 3, size = 5) +
  geom_point(data = procedures_10013310, 
             aes(x = chartdate, y = 1, shape = Procedure), size = 5) +
  labs(title = "Patient 10013310, F, 70 years old, BLACK/AFRICAN", 
       subtitle = paste(top3_diagnoses, collapse = "\n")) +
  xlab("Calendar Time") +
  theme(axis.title.y = element_blank(),
        panel.background = element_rect(fill = "white"),
        panel.grid = element_line(color = "gray", linewidth = 0.2),
        panel.border = element_rect(fill = NA, linewidth = 0.5)) +  
  theme(axis.title = element_text(size = 12), 
        legend.text = element_text(size = 8),
        legend.position = "bottom",  # Placing legend at the bottom
        legend.box = "vertical") +
  scale_y_continuous(breaks = 1:3, labels = c("Procedure", "Lab", "ADT")) +
  coord_cartesian(ylim = c(1, 3)) +
  guides(size = FALSE) +
  guides(shape = guide_legend(nrow = 5)) +
  guides(color = guide_legend(title = "Care Unit"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ggsave("10013310_adt.png", plot, width = 10, height = 6)
Warning: The shape palette can deal with a maximum of 6 discrete values because
more than 6 becomes difficult to discriminate; you have 9. Consider
specifying shapes manually if you must have them.
Warning: Removed 3 rows containing missing values (`geom_point()`).

The visualization for the patient with subject_id 10013310 is shown below.

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer: Here I use the Parquet format chartevents_pq generated in Homework 2 and draw the visualization for the patient with subject_id 10013310.

# filter chartevents for patient 10013310
chartevents_10013310 <- arrow::open_dataset("chartevents_pq", format = 'parquet') |> 
  select(subject_id, stay_id, charttime, itemid, valuenum) |> 
  filter(subject_id == 10013310) |>
  as_tibble() |>
  print(width = Inf)
# A tibble: 7,338 × 5
   subject_id  stay_id charttime           itemid valuenum
        <int>    <int> <dttm>               <int>    <dbl>
 1   10013310 32769810 2153-06-11 01:00:00 223761     98.8
 2   10013310 32769810 2153-06-11 01:00:00 224642     NA  
 3   10013310 32769810 2153-06-11 01:00:00 220046    130  
 4   10013310 32769810 2153-06-11 01:00:00 220047     50  
 5   10013310 32769810 2153-06-11 01:00:00 223751    160  
 6   10013310 32769810 2153-06-11 01:00:00 223752     90  
 7   10013310 32769810 2153-06-11 01:00:00 223769    100  
 8   10013310 32769810 2153-06-11 01:00:00 223770     92  
 9   10013310 32769810 2153-06-11 01:00:00 224161     35  
10   10013310 32769810 2153-06-11 01:00:00 224162      8  
# ℹ 7,328 more rows
# import d_items, filter d_items and for the items of interest: abbreviation = HR, NBPd, NBPs, Temp, RR
d_items <- read_csv("~/mimic/icu/d_items.csv.gz") |> 
  filter(abbreviation %in% c("HR", "NBPd", "NBPs", "Temperature F", "RR")) |> 
  select(itemid, abbreviation) |> 
  print(width = Inf)
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 5 × 2
  itemid abbreviation 
   <dbl> <chr>        
1 220045 HR           
2 220179 NBPs         
3 220180 NBPd         
4 220210 RR           
5 223761 Temperature F
# filter chartevents_10013310 for the items of interest
events_10013310 <- chartevents_10013310 |> 
  filter(itemid %in% d_items$itemid) |> 
  print(width = Inf)
# A tibble: 549 × 5
   subject_id  stay_id charttime           itemid valuenum
        <int>    <int> <dttm>               <int>    <dbl>
 1   10013310 32769810 2153-06-11 01:00:00 223761     98.8
 2   10013310 32769810 2153-06-11 02:00:00 220045    113  
 3   10013310 32769810 2153-06-11 02:00:00 220210     26  
 4   10013310 32769810 2153-06-11 02:02:00 220179    131  
 5   10013310 32769810 2153-06-11 02:02:00 220180     62  
 6   10013310 32769810 2153-06-12 00:00:00 220045    121  
 7   10013310 32769810 2153-06-12 00:00:00 220210     25  
 8   10013310 32769810 2153-06-12 00:03:00 220179    134  
 9   10013310 32769810 2153-06-12 00:03:00 220180     70  
10   10013310 32769810 2153-06-12 01:00:00 223761     99  
# ℹ 539 more rows
# merge chartevents_10013310 and d_items
chartevents_d_10013310 <- events_10013310 |>
  left_join(d_items, by = "itemid") |>
  print(width = Inf)
# A tibble: 549 × 6
   subject_id  stay_id charttime           itemid valuenum abbreviation 
        <int>    <int> <dttm>               <dbl>    <dbl> <chr>        
 1   10013310 32769810 2153-06-11 01:00:00 223761     98.8 Temperature F
 2   10013310 32769810 2153-06-11 02:00:00 220045    113   HR           
 3   10013310 32769810 2153-06-11 02:00:00 220210     26   RR           
 4   10013310 32769810 2153-06-11 02:02:00 220179    131   NBPs         
 5   10013310 32769810 2153-06-11 02:02:00 220180     62   NBPd         
 6   10013310 32769810 2153-06-12 00:00:00 220045    121   HR           
 7   10013310 32769810 2153-06-12 00:00:00 220210     25   RR           
 8   10013310 32769810 2153-06-12 00:03:00 220179    134   NBPs         
 9   10013310 32769810 2153-06-12 00:03:00 220180     70   NBPd         
10   10013310 32769810 2153-06-12 01:00:00 223761     99   Temperature F
# ℹ 539 more rows
# draw the ICU vitals for patient 10013310 and put different vitals in different facets
plot <- ggplot(chartevents_d_10013310, aes(x = charttime, y = valuenum, color = abbreviation)) +
  geom_line() +
  geom_point() +
  facet_grid(abbreviation ~ stay_id, scales = "free") +
  labs(title = "Patient 10013310 ICU stays - Vitals") +
  theme(axis.title.y = element_blank(),
        panel.background = element_rect(fill = "white"),
        panel.grid = element_line(color = "gray", linewidth = 0.5),
        panel.border = element_rect(color = "grey", fill = NA, linewidth = 0.8),
        strip.text = element_text(size = 10)) +  # Adjust the size of facet titles
  scale_color_brewer(palette = "Set1") +
  theme(axis.title.x = element_blank()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(color = FALSE)

ggsave("10013310_icu.png", plot, width = 12, height = 6)

The visualization for the patient with subject_id 10013310 is shown below.

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

# import icustays data
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer: There are 50920 unique subject_id and 12488 subject_id have multiple ICU stays. The number of ICU stays per subject_id is summarized by the following graph.

# number of unique subject_id
icustays_tble |> 
  distinct(subject_id) |> 
  nrow()
[1] 50920
# whether a subject_id can have multiple ICU stays
icustays_tble |> 
  count(subject_id) |> 
  filter(n > 1) |> 
  nrow()
[1] 12448
# number of ICU stays per subject_id
icustays_tble |> 
  group_by(subject_id) |> 
  summarise(n = n()) |>
  ggplot() +
  geom_bar(mapping = aes(x = n)) + 
  labs(title = "Distribution of the Number of ICU Stays per patient", 
       x = "# ICU stays of a patient", y = "Count") +
  theme_minimal()

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

# import admissions data
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  print(width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer: The distribution of the number of admissions per patient, admission hour, admission minute, and length of hospital stay is summarized by the following graphics.

  • number of admissions per patient
# number of admissions per patient
admissions_tble |> 
  group_by(subject_id) |>
  summarise(n = n()) |>
  ggplot() +
  geom_bar(mapping = aes(x = n)) +
  labs(title = "Distribution of the Number of Admissions per patient", 
       x = "# Admissions of a patient", y = "Count") +
  theme_minimal()

  • admission hour (anything unusual?)
# admission hour
admissions_tble |>
  mutate(admission_hour = hour(admittime)) |>
  count(admission_hour) |>
  ggplot(aes(x = admission_hour, y = n)) +
  geom_col() +
  labs(title = "Distribution of Admission Hour", 
       x = "Admission Hour", 
       y = "Frequency") +
  theme_minimal()

Answer: In the distribution of admission hour, 0 am and 7 am are unusual because the number of admissions at these hours is much higher than other hours near them.

  • admission minute (anything unusual?)
# admission minute
admissions_tble |>
  mutate(admission_minute = minute(admittime)) |>
  count(admission_minute) |>
  ggplot(aes(x = admission_minute, y = n)) +
  geom_col() +
  labs(title = "Distribution of Admission Minute", 
       x = "Admission minute", 
       y = "Frequency") +
  theme_minimal()

Answer: In the distribution of admission minute, minute 0, 15, 30, 45 are unusual because the number of admissions at these minute is much higher than other minutes near them.

  • length of hospital stay (from admission to discharge) (anything unusual?)
# length of hospital stay
admissions_tble |> 
  mutate(length_of_stay = as.numeric(difftime(dischtime, admittime, units = "days"))) |> 
  ggplot(aes(x = length_of_stay)) +
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of length of hospital stay", 
       x = "Length of hospital stay (days)", y = "Count") +
  theme_minimal()

Answer: The distribution of length of hospital stay is severely right-skewed. The majority of patients stay in the hospital for less than 30 days and most of them stay for less than 10 days. There are also some patients who stay in the hospital for nearly 300 days.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

# import patients data
patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz") |> 
  print(width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer: The distribution of gender and anchor_age is summarized by the following boxplot and barplot.

  • From the boxplot, we can see that the distribution of anchor_age is similar within each gender group. The median anchor age of women is slightly lower than that of men.

  • From the barplot, The distribution of anchor_age is right-skewed. In many age levels, the number of men is higher than the number of women. It is worth noting that both the number of women and the number of men are very high at the maximum age of 91, which is unusual.

# draw the boxplot
ggplot(data = patients_tble, mapping = aes(x = gender, y = anchor_age)) + 
  geom_boxplot() +
  labs(title = "Distributions of anchor age by gender") +
  labs(x = "Gender", y = "Anchor age (years)")

# draw the barplot
ggplot(data = patients_tble) + 
  geom_bar(mapping = aes(x = anchor_age, fill = gender)) +
  labs(title = "Distributions of anchor age") +
  labs(x = "Anchor age (years)", fill = "Gender")

max(patients_tble$anchor_age)
[1] 91

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer:

# import data
labevents_filter <- arrow::open_dataset("labevents_pq", format = 'parquet') |> 
  select(subject_id, itemid, storetime, charttime, valuenum) |>
  filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) |>
  as_tibble()
d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz") |>
  filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) |>
  print(width = Inf)
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 8 × 4
  itemid label             fluid category  
   <dbl> <chr>             <chr> <chr>     
1  50882 Bicarbonate       Blood Chemistry 
2  50902 Chloride          Blood Chemistry 
3  50912 Creatinine        Blood Chemistry 
4  50931 Glucose           Blood Chemistry 
5  50971 Potassium         Blood Chemistry 
6  50983 Sodium            Blood Chemistry 
7  51221 Hematocrit        Blood Hematology
8  51301 White Blood Cells Blood Hematology
# filter labevents s.t. only the lab measurements for the patients in icustays_tble are retained and the last available measurement (by `storetime`) before the ICU stay is retained
labevents_icu <- labevents_filter |>
  filter(subject_id %in% icustays_tble$subject_id) |>
  left_join(
    select(icustays_tble, subject_id, stay_id, intime), 
    by = "subject_id") |>
  group_by(subject_id, stay_id, itemid) |>
  filter(storetime < intime) |>
  filter(storetime == max(storetime)) |>
  ungroup() |>
  collect() |>
  as_tibble() |>
  print(width = Inf)
Warning in left_join(filter(labevents_filter, subject_id %in% icustays_tble$subject_id), : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 24 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 525,243 × 7
   subject_id itemid storetime           charttime           valuenum  stay_id
        <dbl>  <int> <dttm>              <dttm>                 <dbl>    <dbl>
 1   10003400  51221 2137-02-25 13:58:00 2137-02-25 13:27:00     22.1 32128372
 2   10003400  51301 2137-02-25 13:59:00 2137-02-25 13:27:00      4.3 32128372
 3   10003400  50882 2137-02-25 14:57:00 2137-02-25 13:27:00     15   32128372
 4   10003400  50902 2137-02-25 14:57:00 2137-02-25 13:27:00    112   32128372
 5   10003400  50912 2137-02-25 14:57:00 2137-02-25 13:27:00      0.6 32128372
 6   10003400  50931 2137-02-25 14:57:00 2137-02-25 13:27:00    137   32128372
 7   10003400  50971 2137-02-25 14:57:00 2137-02-25 13:27:00      4.4 32128372
 8   10003400  50983 2137-02-25 14:57:00 2137-02-25 13:27:00    139   32128372
 9   10003400  51221 2137-08-10 00:48:00 2137-08-09 22:34:00     25.7 34577403
10   10003400  51301 2137-08-10 00:48:00 2137-08-09 22:34:00      6.6 34577403
   intime             
   <dttm>             
 1 2137-02-25 23:37:19
 2 2137-02-25 23:37:19
 3 2137-02-25 23:37:19
 4 2137-02-25 23:37:19
 5 2137-02-25 23:37:19
 6 2137-02-25 23:37:19
 7 2137-02-25 23:37:19
 8 2137-02-25 23:37:19
 9 2137-08-10 19:54:51
10 2137-08-10 19:54:51
# ℹ 525,233 more rows
# restrict to the last available measurement (by `storetime`) before the ICU stay.
labevents_item <- labevents_icu |>
  left_join(
    select(d_labitems, itemid, label), 
    by = "itemid") |>
  print(width = Inf)
# A tibble: 525,243 × 8
   subject_id itemid storetime           charttime           valuenum  stay_id
        <dbl>  <dbl> <dttm>              <dttm>                 <dbl>    <dbl>
 1   10003400  51221 2137-02-25 13:58:00 2137-02-25 13:27:00     22.1 32128372
 2   10003400  51301 2137-02-25 13:59:00 2137-02-25 13:27:00      4.3 32128372
 3   10003400  50882 2137-02-25 14:57:00 2137-02-25 13:27:00     15   32128372
 4   10003400  50902 2137-02-25 14:57:00 2137-02-25 13:27:00    112   32128372
 5   10003400  50912 2137-02-25 14:57:00 2137-02-25 13:27:00      0.6 32128372
 6   10003400  50931 2137-02-25 14:57:00 2137-02-25 13:27:00    137   32128372
 7   10003400  50971 2137-02-25 14:57:00 2137-02-25 13:27:00      4.4 32128372
 8   10003400  50983 2137-02-25 14:57:00 2137-02-25 13:27:00    139   32128372
 9   10003400  51221 2137-08-10 00:48:00 2137-08-09 22:34:00     25.7 34577403
10   10003400  51301 2137-08-10 00:48:00 2137-08-09 22:34:00      6.6 34577403
   intime              label            
   <dttm>              <chr>            
 1 2137-02-25 23:37:19 Hematocrit       
 2 2137-02-25 23:37:19 White Blood Cells
 3 2137-02-25 23:37:19 Bicarbonate      
 4 2137-02-25 23:37:19 Chloride         
 5 2137-02-25 23:37:19 Creatinine       
 6 2137-02-25 23:37:19 Glucose          
 7 2137-02-25 23:37:19 Potassium        
 8 2137-02-25 23:37:19 Sodium           
 9 2137-08-10 19:54:51 Hematocrit       
10 2137-08-10 19:54:51 White Blood Cells
# ℹ 525,233 more rows
# The final `labevents_tble` should have one row per ICU stay and columns for each lab measurement so we convert the table from long to wide. If multiple values of the same item with the same storetime, use the one with the latest charttime. If multiple values of the same item with the same storetime and charttime, use the last valuenum.
labevents_tble <- labevents_item |>
  group_by(subject_id, stay_id, storetime, label) |> 
  filter(charttime == max(charttime)) |>
  ungroup() |>
  group_by(subject_id, stay_id, label, charttime) |>
  # retain the last row
  slice_tail() |>
  ungroup() |>
  select(subject_id, stay_id, label, valuenum) |>
  pivot_wider(names_from = label, values_from = valuenum) |>
  print(width = Inf)
# A tibble: 68,467 × 10
   subject_id  stay_id Bicarbonate Chloride Creatinine Glucose Hematocrit
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>      <dbl>
 1   10000032 39553978          25       95        0.7     102       41.1
 2   10000980 39765666          21      109        2.3      89       27.3
 3   10001217 34592300          30      104        0.5      87       37.4
 4   10001217 37067082          22      108        0.6     112       38.1
 5   10001725 31205490          NA       98       NA        NA       NA  
 6   10001884 37510196          30       88        1.1     141       39.7
 7   10002013 39060235          24      102        0.9     288       34.9
 8   10002155 31090461          23       98        2.8     117       25.5
 9   10002155 32358465          26       85        1.4     133       22.4
10   10002155 33685454          24      105        1.1     138       39.7
   Potassium Sodium `White Blood Cells`
       <dbl>  <dbl>               <dbl>
 1       6.7    126                 6.9
 2       3.9    144                 5.3
 3       4.1    142                 5.4
 4       4.2    142                15.7
 5       4.1    139                NA  
 6       4.5    130                12.2
 7       3.5    137                 7.2
 8       4.9    135                17.9
 9       5.7    120                 9.8
10       4.6    139                 7.9
# ℹ 68,457 more rows

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

# import chartevents and d_items data
chartevents_filter <- arrow::open_dataset("chartevents_pq", format = 'parquet') |> 
  select(subject_id, itemid, stay_id, charttime, storetime, valuenum) |> 
  filter(itemid %in% c(220045, 220179, 220180, 223761, 220210)) |>
  collect() |>
  as_tibble() |>
  print(width = Inf)
# A tibble: 22,504,119 × 6
   subject_id itemid  stay_id charttime           storetime           valuenum
        <int>  <int>    <int> <dttm>              <dttm>                 <dbl>
 1   10001884 220179 37510196 2131-01-12 12:01:00 2131-01-12 12:42:00      149
 2   10001884 220180 37510196 2131-01-12 12:01:00 2131-01-12 12:42:00       97
 3   10001884 220045 37510196 2131-01-12 13:00:00 2131-01-12 13:21:00       78
 4   10001884 220210 37510196 2131-01-12 13:00:00 2131-01-12 13:21:00       18
 5   10001884 220179 37510196 2131-01-12 13:01:00 2131-01-12 13:21:00      145
 6   10001884 220180 37510196 2131-01-12 13:01:00 2131-01-12 13:21:00       80
 7   10001884 220045 37510196 2131-01-12 14:00:00 2131-01-12 14:14:00       82
 8   10001884 220210 37510196 2131-01-12 14:00:00 2131-01-12 14:14:00       28
 9   10001884 220179 37510196 2131-01-12 14:01:00 2131-01-12 14:14:00      144
10   10001884 220180 37510196 2131-01-12 14:01:00 2131-01-12 14:14:00       83
# ℹ 22,504,109 more rows
d_items <- read_csv("~/mimic/icu/d_items.csv.gz") |> 
  filter(itemid %in% c(220045, 220179, 220180, 223761, 220210)) |> 
  print(width = Inf)
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 5 × 9
  itemid label                                 abbreviation  linksto    
   <dbl> <chr>                                 <chr>         <chr>      
1 220045 Heart Rate                            HR            chartevents
2 220179 Non Invasive Blood Pressure systolic  NBPs          chartevents
3 220180 Non Invasive Blood Pressure diastolic NBPd          chartevents
4 220210 Respiratory Rate                      RR            chartevents
5 223761 Temperature Fahrenheit                Temperature F chartevents
  category            unitname param_type lownormalvalue highnormalvalue
  <chr>               <chr>    <chr>               <dbl>           <dbl>
1 Routine Vital Signs bpm      Numeric                NA              NA
2 Routine Vital Signs mmHg     Numeric                NA              NA
3 Routine Vital Signs mmHg     Numeric                NA              NA
4 Respiratory         insp/min Numeric                NA              NA
5 Routine Vital Signs °F       Numeric                NA              NA
# retrieve a subset of `chartevents.csv.gz` only containing these items for the patients in `icustays_tble`, and further restrict to the first vital measurement within the ICU stay.
chartevents_icu <- chartevents_filter |>
  left_join(
    select(icustays_tble, subject_id, stay_id, intime, outtime), 
    by = c("subject_id", "stay_id")) |>
  group_by(subject_id, stay_id, itemid) |>
  filter(storetime > intime & storetime < outtime) |>
  filter(storetime == min(storetime)) |>
  ungroup() |>
  collect() |>
  as_tibble() |>
  print(width = Inf)
# A tibble: 496,140 × 8
   subject_id itemid  stay_id charttime           storetime           valuenum
        <dbl>  <int>    <dbl> <dttm>              <dttm>                 <dbl>
 1   10001884 220210 37510196 2131-01-10 20:20:00 2131-01-10 21:50:00       16
 2   10001884 220045 37510196 2131-01-10 20:22:00 2131-01-10 21:50:00       38
 3   10001884 220179 37510196 2131-01-10 20:25:00 2131-01-10 21:50:00      180
 4   10001884 220180 37510196 2131-01-10 20:25:00 2131-01-10 21:50:00       12
 5   10001884 220045 37510196 2131-01-10 21:00:00 2131-01-10 21:50:00       60
 6   10001884 220210 37510196 2131-01-10 21:00:00 2131-01-10 21:50:00       10
 7   10001884 220179 37510196 2131-01-10 21:01:00 2131-01-10 21:50:00      167
 8   10001884 220180 37510196 2131-01-10 21:01:00 2131-01-10 21:50:00       49
 9   10002013 220179 39060235 2160-05-19 00:53:00 2160-05-19 01:29:00      104
10   10002013 220180 39060235 2160-05-19 00:53:00 2160-05-19 01:29:00       70
   intime              outtime            
   <dttm>              <dttm>             
 1 2131-01-11 04:20:05 2131-01-20 08:27:30
 2 2131-01-11 04:20:05 2131-01-20 08:27:30
 3 2131-01-11 04:20:05 2131-01-20 08:27:30
 4 2131-01-11 04:20:05 2131-01-20 08:27:30
 5 2131-01-11 04:20:05 2131-01-20 08:27:30
 6 2131-01-11 04:20:05 2131-01-20 08:27:30
 7 2131-01-11 04:20:05 2131-01-20 08:27:30
 8 2131-01-11 04:20:05 2131-01-20 08:27:30
 9 2160-05-18 10:00:53 2160-05-19 17:33:33
10 2160-05-18 10:00:53 2160-05-19 17:33:33
# ℹ 496,130 more rows
# merge chartevents and d_items
chartevents_item <- chartevents_icu |>
  left_join(
    select(d_items, itemid, label), 
    by = "itemid") |>
  print(width = Inf)
# A tibble: 496,140 × 9
   subject_id itemid  stay_id charttime           storetime           valuenum
        <dbl>  <dbl>    <dbl> <dttm>              <dttm>                 <dbl>
 1   10001884 220210 37510196 2131-01-10 20:20:00 2131-01-10 21:50:00       16
 2   10001884 220045 37510196 2131-01-10 20:22:00 2131-01-10 21:50:00       38
 3   10001884 220179 37510196 2131-01-10 20:25:00 2131-01-10 21:50:00      180
 4   10001884 220180 37510196 2131-01-10 20:25:00 2131-01-10 21:50:00       12
 5   10001884 220045 37510196 2131-01-10 21:00:00 2131-01-10 21:50:00       60
 6   10001884 220210 37510196 2131-01-10 21:00:00 2131-01-10 21:50:00       10
 7   10001884 220179 37510196 2131-01-10 21:01:00 2131-01-10 21:50:00      167
 8   10001884 220180 37510196 2131-01-10 21:01:00 2131-01-10 21:50:00       49
 9   10002013 220179 39060235 2160-05-19 00:53:00 2160-05-19 01:29:00      104
10   10002013 220180 39060235 2160-05-19 00:53:00 2160-05-19 01:29:00       70
   intime              outtime             label                                
   <dttm>              <dttm>              <chr>                                
 1 2131-01-11 04:20:05 2131-01-20 08:27:30 Respiratory Rate                     
 2 2131-01-11 04:20:05 2131-01-20 08:27:30 Heart Rate                           
 3 2131-01-11 04:20:05 2131-01-20 08:27:30 Non Invasive Blood Pressure systolic 
 4 2131-01-11 04:20:05 2131-01-20 08:27:30 Non Invasive Blood Pressure diastolic
 5 2131-01-11 04:20:05 2131-01-20 08:27:30 Heart Rate                           
 6 2131-01-11 04:20:05 2131-01-20 08:27:30 Respiratory Rate                     
 7 2131-01-11 04:20:05 2131-01-20 08:27:30 Non Invasive Blood Pressure systolic 
 8 2131-01-11 04:20:05 2131-01-20 08:27:30 Non Invasive Blood Pressure diastolic
 9 2160-05-18 10:00:53 2160-05-19 17:33:33 Non Invasive Blood Pressure systolic 
10 2160-05-18 10:00:53 2160-05-19 17:33:33 Non Invasive Blood Pressure diastolic
# ℹ 496,130 more rows
# The final `chartevents_tble` should have one row per ICU stay and columns for each vital measurement so we convert the table from long to wide. If multiple values of the same item with the same storetime, use the one with the latest charttime. If multiple values of the same item with the same storetime and charttime, use the last valuenum.
chartevents_tble <- chartevents_item |>
  group_by(subject_id, stay_id, storetime, label) |> 
  filter(charttime == max(charttime)) |>
  ungroup() |>
  group_by(subject_id, stay_id, label, charttime) |>
  # retain the last row
  slice_tail() |>
  ungroup() |>
  select(subject_id, stay_id, label, valuenum) |>
  pivot_wider(names_from = label, values_from = valuenum) |>
  print(width = Inf)
# A tibble: 73,111 × 7
   subject_id  stay_id `Heart Rate` `Non Invasive Blood Pressure diastolic`
        <dbl>    <dbl>        <dbl>                                   <dbl>
 1   10000032 39553978           91                                      48
 2   10000980 39765666           75                                     127
 3   10001217 34592300           69                                      88
 4   10001217 37067082           86                                      90
 5   10001725 31205490           86                                      56
 6   10001884 37510196           60                                      49
 7   10002013 39060235           80                                      54
 8   10002155 31090461           91                                      51
 9   10002155 32358465          114                                      65
10   10002155 33685454           71                                      61
   `Non Invasive Blood Pressure systolic` `Respiratory Rate`
                                    <dbl>              <dbl>
 1                                     84                 24
 2                                    158                 24
 3                                    145                 14
 4                                    151                 18
 5                                     73                 19
 6                                    167                 10
 7                                     93                 14
 8                                    118                 18
 9                                    109                 24
10                                    126                 17
   `Temperature Fahrenheit`
                      <dbl>
 1                     98.7
 2                     98  
 3                     97.6
 4                     98.5
 5                     97.7
 6                     98.1
 7                     97.2
 8                     96.9
 9                     97.7
10                     95.9
# ℹ 73,101 more rows

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer:

# filter icustays_tble for adults
icustays_adults <- icustays_tble |>
  left_join(admissions_tble, by = c("subject_id", "hadm_id")) |>
  left_join(patients_tble, by = "subject_id") |>
  mutate(
    age_intime = anchor_age + as.double(year(intime)) - anchor_year) |>
  filter(age_intime >= 18) |>
  print(width = Inf)
# A tibble: 73,181 × 28
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los admittime           dischtime          
   <dttm>              <dbl> <dttm>              <dttm>             
 1 2180-07-23 23:50:47 0.410 2180-07-23 12:35:00 2180-07-25 17:55:00
 2 2189-06-27 20:38:27 0.498 2189-06-27 07:38:00 2189-07-03 03:00:00
 3 2157-11-21 22:08:00 1.12  2157-11-18 22:56:00 2157-11-25 18:00:00
 4 2157-12-20 14:27:41 0.948 2157-12-18 16:58:00 2157-12-24 14:55:00
 5 2110-04-12 23:59:56 1.34  2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20 08:27:30 9.17  2131-01-07 20:39:00 2131-01-20 05:15:00
 7 2160-05-19 17:33:33 1.31  2160-05-18 07:45:00 2160-05-23 13:30:00
 8 2131-03-10 18:09:21 0.859 2131-03-09 20:33:00 2131-03-10 01:55:00
 9 2129-08-10 17:02:38 6.18  2129-08-04 12:44:00 2129-08-18 16:53:00
10 2130-09-27 22:13:41 3.89  2130-09-23 21:59:00 2130-09-29 18:55:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  EW EMER.                    P4645A           
 4 NA                  DIRECT EMER.                P99698           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-20 05:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 2131-03-10 21:53:00 EW EMER.                    P80515           
 9 NA                  EW EMER.                    P05HUO           
10 NA                  EW EMER.                    P3529J           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 4 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     DIED                         Other     ENGLISH 
 9 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
10 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-23 05:54:00 2180-07-23 14:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-27 06:25:00 2189-06-27 08:42:00
 3 MARRIED        WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 4 MARRIED        WHITE                  NA                  NA                 
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 13:36:00 2131-01-07 22:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
 9 MARRIED        WHITE                  2129-08-04 11:00:00 2129-08-04 12:35:00
10 MARRIED        WHITE                  2130-09-23 19:59:00 2130-09-24 00:50:00
   hospital_expire_flag gender anchor_age anchor_year anchor_year_group
                  <dbl> <chr>       <dbl>       <dbl> <chr>            
 1                    0 F              52        2180 2014 - 2016      
 2                    0 F              73        2186 2008 - 2010      
 3                    0 F              55        2157 2011 - 2013      
 4                    0 F              55        2157 2011 - 2013      
 5                    0 F              46        2110 2011 - 2013      
 6                    1 F              68        2122 2008 - 2010      
 7                    0 F              53        2156 2008 - 2010      
 8                    1 F              80        2128 2008 - 2010      
 9                    0 F              80        2128 2008 - 2010      
10                    0 F              80        2128 2008 - 2010      
   dod        age_intime
   <date>          <dbl>
 1 2180-09-09         52
 2 2193-08-26         76
 3 NA                 55
 4 NA                 55
 5 NA                 46
 6 2131-01-20         77
 7 NA                 57
 8 2131-03-10         83
 9 2131-03-10         81
10 2131-03-10         82
# ℹ 73,171 more rows
# merge icustays_adults, labevents_tble and chartevents_tble
mimic_icu_cohort <- icustays_adults |>
  left_join(labevents_tble, by = c("subject_id", "stay_id")) |>
  left_join(chartevents_tble, by = c("subject_id", "stay_id")) |>
  print(width = Inf)
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los admittime           dischtime          
   <dttm>              <dbl> <dttm>              <dttm>             
 1 2180-07-23 23:50:47 0.410 2180-07-23 12:35:00 2180-07-25 17:55:00
 2 2189-06-27 20:38:27 0.498 2189-06-27 07:38:00 2189-07-03 03:00:00
 3 2157-11-21 22:08:00 1.12  2157-11-18 22:56:00 2157-11-25 18:00:00
 4 2157-12-20 14:27:41 0.948 2157-12-18 16:58:00 2157-12-24 14:55:00
 5 2110-04-12 23:59:56 1.34  2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20 08:27:30 9.17  2131-01-07 20:39:00 2131-01-20 05:15:00
 7 2160-05-19 17:33:33 1.31  2160-05-18 07:45:00 2160-05-23 13:30:00
 8 2131-03-10 18:09:21 0.859 2131-03-09 20:33:00 2131-03-10 01:55:00
 9 2129-08-10 17:02:38 6.18  2129-08-04 12:44:00 2129-08-18 16:53:00
10 2130-09-27 22:13:41 3.89  2130-09-23 21:59:00 2130-09-29 18:55:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  EW EMER.                    P4645A           
 4 NA                  DIRECT EMER.                P99698           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-20 05:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 2131-03-10 21:53:00 EW EMER.                    P80515           
 9 NA                  EW EMER.                    P05HUO           
10 NA                  EW EMER.                    P3529J           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 4 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     DIED                         Other     ENGLISH 
 9 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
10 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-23 05:54:00 2180-07-23 14:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-27 06:25:00 2189-06-27 08:42:00
 3 MARRIED        WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 4 MARRIED        WHITE                  NA                  NA                 
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 13:36:00 2131-01-07 22:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
 9 MARRIED        WHITE                  2129-08-04 11:00:00 2129-08-04 12:35:00
10 MARRIED        WHITE                  2130-09-23 19:59:00 2130-09-24 00:50:00
   hospital_expire_flag gender anchor_age anchor_year anchor_year_group
                  <dbl> <chr>       <dbl>       <dbl> <chr>            
 1                    0 F              52        2180 2014 - 2016      
 2                    0 F              73        2186 2008 - 2010      
 3                    0 F              55        2157 2011 - 2013      
 4                    0 F              55        2157 2011 - 2013      
 5                    0 F              46        2110 2011 - 2013      
 6                    1 F              68        2122 2008 - 2010      
 7                    0 F              53        2156 2008 - 2010      
 8                    1 F              80        2128 2008 - 2010      
 9                    0 F              80        2128 2008 - 2010      
10                    0 F              80        2128 2008 - 2010      
   dod        age_intime Bicarbonate Chloride Creatinine Glucose Hematocrit
   <date>          <dbl>       <dbl>    <dbl>      <dbl>   <dbl>      <dbl>
 1 2180-09-09         52          25       95        0.7     102       41.1
 2 2193-08-26         76          21      109        2.3      89       27.3
 3 NA                 55          22      108        0.6     112       38.1
 4 NA                 55          30      104        0.5      87       37.4
 5 NA                 46          NA       98       NA        NA       NA  
 6 2131-01-20         77          30       88        1.1     141       39.7
 7 NA                 57          24      102        0.9     288       34.9
 8 2131-03-10         83          26       85        1.4     133       22.4
 9 2131-03-10         81          24      105        1.1     138       39.7
10 2131-03-10         82          23       98        2.8     117       25.5
   Potassium Sodium `White Blood Cells` `Heart Rate`
       <dbl>  <dbl>               <dbl>        <dbl>
 1       6.7    126                 6.9           91
 2       3.9    144                 5.3           75
 3       4.2    142                15.7           86
 4       4.1    142                 5.4           69
 5       4.1    139                NA             86
 6       4.5    130                12.2           60
 7       3.5    137                 7.2           80
 8       5.7    120                 9.8          114
 9       4.6    139                 7.9           71
10       4.9    135                17.9           91
   `Non Invasive Blood Pressure diastolic`
                                     <dbl>
 1                                      48
 2                                     127
 3                                      90
 4                                      88
 5                                      56
 6                                      49
 7                                      54
 8                                      65
 9                                      61
10                                      51
   `Non Invasive Blood Pressure systolic` `Respiratory Rate`
                                    <dbl>              <dbl>
 1                                     84                 24
 2                                    158                 24
 3                                    151                 18
 4                                    145                 14
 5                                     73                 19
 6                                    167                 10
 7                                     93                 14
 8                                    109                 24
 9                                    126                 17
10                                    118                 18
   `Temperature Fahrenheit`
                      <dbl>
 1                     98.7
 2                     98  
 3                     98.5
 4                     97.6
 5                     97.7
 6                     98.1
 7                     97.2
 8                     97.7
 9                     95.9
10                     96.9
# ℹ 73,171 more rows

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the first available vital measurements within the ICU stay

  • Length of ICU stay los vs first ICU unit

Answer:

# get the summary of the cohort
summary(mimic_icu_cohort)
   subject_id          hadm_id            stay_id         first_careunit    
 Min.   :10000032   Min.   :20000094   Min.   :30000153   Length:73181      
 1st Qu.:12491257   1st Qu.:22483349   1st Qu.:32489353   Class :character  
 Median :14998941   Median :24969636   Median :34993892   Mode  :character  
 Mean   :14998189   Mean   :24981328   Mean   :34992735                     
 3rd Qu.:17513267   3rd Qu.:27471796   3rd Qu.:37488401                     
 Max.   :19999987   Max.   :29999828   Max.   :39999810                     
                                                                            
 last_careunit          intime                      
 Length:73181       Min.   :2110-01-11 10:16:06.00  
 Class :character   1st Qu.:2133-11-09 22:25:00.00  
 Mode  :character   Median :2153-09-23 21:24:00.00  
                    Mean   :2153-11-14 08:26:11.48  
                    3rd Qu.:2174-02-03 15:48:30.00  
                    Max.   :2211-11-09 03:33:00.00  
                                                    
    outtime                            los           
 Min.   :2110-01-12 17:17:47.00   Min.   :  0.00125  
 1st Qu.:2133-11-12 15:55:48.00   1st Qu.:  1.08422  
 Median :2153-09-25 21:45:39.00   Median :  1.92678  
 Mean   :2153-11-17 19:16:58.31   Mean   :  3.45193  
 3rd Qu.:2174-02-06 19:15:52.00   3rd Qu.:  3.71332  
 Max.   :2211-11-11 18:34:18.00   Max.   :110.23228  
                                                     
   admittime                        dischtime                     
 Min.   :2110-01-11 10:14:00.00   Min.   :2110-01-15 17:31:00.00  
 1st Qu.:2133-11-09 18:39:00.00   1st Qu.:2133-11-16 16:03:00.00  
 Median :2153-09-23 06:26:00.00   Median :2153-10-01 19:26:00.00  
 Mean   :2153-11-12 06:55:34.26   Mean   :2153-11-23 06:05:30.06  
 3rd Qu.:2174-01-31 11:23:00.00   3rd Qu.:2174-02-12 19:43:00.00  
 Max.   :2211-11-09 02:26:00.00   Max.   :2211-11-11 18:15:00.00  
                                                                  
   deathtime                      admission_type     admit_provider_id 
 Min.   :2110-01-25 09:40:00.00   Length:73181       Length:73181      
 1st Qu.:2133-07-16 13:30:00.00   Class :character   Class :character  
 Median :2153-06-06 23:26:00.00   Mode  :character   Mode  :character  
 Mean   :2153-08-24 05:26:31.77                                        
 3rd Qu.:2173-10-30 20:38:00.00                                        
 Max.   :2211-01-17 12:34:00.00                                        
 NA's   :64860                                                         
 admission_location discharge_location  insurance           language        
 Length:73181       Length:73181       Length:73181       Length:73181      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 marital_status         race             edregtime                     
 Length:73181       Length:73181       Min.   :2110-01-11 21:42:00.00  
 Class :character   Class :character   1st Qu.:2134-12-30 12:00:30.00  
 Mode  :character   Mode  :character   Median :2154-08-09 10:54:00.00  
                                       Mean   :2154-10-19 16:19:52.48  
                                       3rd Qu.:2174-12-23 17:58:00.00  
                                       Max.   :2211-11-08 18:57:00.00  
                                       NA's   :24898                   
   edouttime                      hospital_expire_flag    gender         
 Min.   :2110-01-12 00:54:00.00   Min.   :0.0000       Length:73181      
 1st Qu.:2134-12-30 16:18:00.00   1st Qu.:0.0000       Class :character  
 Median :2154-08-09 11:12:00.00   Median :0.0000       Mode  :character  
 Mean   :2154-10-19 22:24:59.31   Mean   :0.1138                         
 3rd Qu.:2174-12-23 22:15:00.00   3rd Qu.:0.0000                         
 Max.   :2211-11-09 03:33:00.00   Max.   :1.0000                         
 NA's   :24898                                                           
   anchor_age    anchor_year   anchor_year_group       dod            
 Min.   :18.0   Min.   :2110   Length:73181       Min.   :2110-01-25  
 1st Qu.:53.0   1st Qu.:2132   Class :character   1st Qu.:2135-07-20  
 Median :65.0   Median :2152   Mode  :character   Median :2155-06-23  
 Mean   :63.2   Mean   :2152                      Mean   :2155-09-12  
 3rd Qu.:76.0   3rd Qu.:2172                      3rd Qu.:2175-12-18  
 Max.   :91.0   Max.   :2207                      Max.   :2211-01-17  
                                                  NA's   :44907       
   age_intime      Bicarbonate       Chloride       Creatinine    
 Min.   : 18.00   Min.   : 2.00   Min.   : 45.0   Min.   : 0.000  
 1st Qu.: 54.00   1st Qu.:22.00   1st Qu.: 98.0   1st Qu.: 0.800  
 Median : 66.00   Median :25.00   Median :102.0   Median : 1.000  
 Mean   : 64.68   Mean   :24.34   Mean   :101.2   Mean   : 1.494  
 3rd Qu.: 77.00   3rd Qu.:27.00   3rd Qu.:105.0   3rd Qu.: 1.500  
 Max.   :102.00   Max.   :50.00   Max.   :144.0   Max.   :43.000  
                  NA's   :9050    NA's   :8883    NA's   :5770    
    Glucose       Hematocrit      Potassium          Sodium    
 Min.   :   4   Min.   : 6.50   Min.   : 1.300   Min.   : 74   
 1st Qu.:  99   1st Qu.:29.70   1st Qu.: 3.800   1st Qu.:136   
 Median : 120   Median :35.20   Median : 4.200   Median :139   
 Mean   : 144   Mean   :34.84   Mean   : 4.317   Mean   :138   
 3rd Qu.: 156   3rd Qu.:40.00   3rd Qu.: 4.600   3rd Qu.:141   
 Max.   :2340   Max.   :69.70   Max.   :10.000   Max.   :180   
 NA's   :9099   NA's   :5017    NA's   :8901     NA's   :8872  
 White Blood Cells   Heart Rate     Non Invasive Blood Pressure diastolic
 Min.   :  0.10    Min.   :  0.00   Min.   :   0.00                      
 1st Qu.:  6.80    1st Qu.: 74.00   1st Qu.:  56.00                      
 Median :  9.30    Median : 85.00   Median :  66.00                      
 Mean   : 11.06    Mean   : 87.88   Mean   :  68.72                      
 3rd Qu.: 13.20    3rd Qu.:100.00   3rd Qu.:  79.00                      
 Max.   :479.20    Max.   :941.00   Max.   :7793.00                      
 NA's   :5094      NA's   :74       NA's   :996                          
 Non Invasive Blood Pressure systolic Respiratory Rate Temperature Fahrenheit
 Min.   :     0.0                     Min.   :  0.0    Min.   :  0.00        
 1st Qu.:   105.0                     1st Qu.: 15.0    1st Qu.: 97.60        
 Median :   120.0                     Median : 18.0    Median : 98.20        
 Mean   :   124.1                     Mean   : 19.2    Mean   : 98.06        
 3rd Qu.:   138.0                     3rd Qu.: 22.0    3rd Qu.: 98.70        
 Max.   :116114.0                     Max.   :180.0    Max.   :105.80        
 NA's   :993                          NA's   :148      NA's   :1301          
# draw the distribution of los
ggplot(data = mimic_icu_cohort) + 
  geom_histogram(mapping = aes(x = los)) +
  labs(title = "Distribution of Length of ICU Stay", 
       x = "Length of ICU Stay (los)", y = "Frequency") +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# devide los into 6 groups
mimic_icu_cohort <- mimic_icu_cohort |>
  mutate(los_group = cut(los, breaks = c(0, 2.5, 5, 10, 15, 20, 115)))

Q8.1 Length of ICU stay los vs demographic variables

# Length of ICU stay `los` vs demographic variables
# Visualize the distribution of los for each demographic variable

# Define demographic variables
demo_vars <- c("race", "insurance", "marital_status", "gender")

# Create an empty list to store plots
plots_demo <- list()

# Iterate over each demographic variable
for (var in demo_vars) {
  p <- ggplot(mimic_icu_cohort) +
    geom_bar(mapping = aes(x = !!sym(var), fill = los_group)) +
    labs(title = paste("Length of ICU Stay vs ", var)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          axis.title = element_text(size = 8))
  
  # Add the plot to the list
  plots_demo[[var]] <- p
}
# print the plots
plots_demo
$race


$insurance


$marital_status


$gender

# plot for age at intime vs los with loess smoothed line
# devide age into 5 groups
mimic_icu_cohort <- mimic_icu_cohort |>
  mutate(
    age_group = cut(age_intime, 
                    breaks = c(17, 30, 40, 50, 60, 70, 80, 90, 103)))

# plot the bar plot for each group
ggplot(data = mimic_icu_cohort) + 
  geom_bar(mapping = aes(x = age_group, fill = los_group)) +
  labs(title = "Distributions of age at intime") +
  labs(x = "Age at Intime (years)", fill = "Length of ICU Stay (los)") + 
  theme_minimal()

max(mimic_icu_cohort$age_intime)
[1] 102

Q8.2 Length of ICU stay los vs the last available lab measurements before ICU stay

# Length of ICU stay `los` vs the last available lab measurements before ICU stay
# Visualize the distribution of los for each lab measurement

# Define lab measurement variables
lab_vars <- c("Creatinine", "Potassium", "Sodium", "Chloride", "Bicarbonate", "Hematocrit", "Glucose", "White Blood Cells")

# Create an empty list to store plots
plots_lab <- list()

# Iterate over each demographic variable
for (var in lab_vars) {
  p <- ggplot(mimic_icu_cohort, aes(x = !!sym(var), y = los)) +
   # geom_point() +
    geom_smooth() +
    labs(x = var) +
    theme(axis.title = element_text(size = 12), 
        legend.text = element_text(size = 8))
    theme_minimal()
  
  # Add the plot to the list
  plots_lab[[var]] <- p
}
# print the plots
grid.arrange(grobs = plots_lab, nrow = 3, 
  top = "Length of ICU Stay vs Last Available Lab Measurements")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 5770 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 8901 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 8872 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 8883 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 9050 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 5017 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 9099 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 5094 rows containing non-finite values (`stat_smooth()`).

Q8.3 Length of ICU stay los vs the first vital measurements within the ICU stay

# Length of ICU stay `los` vs the first vital measurements within the ICU stay
# Visualize the distribution of los for each vital measurement

# Define vital measurement variables
vital_vars <- c("Heart Rate", "Non Invasive Blood Pressure systolic", 
                "Non Invasive Blood Pressure diastolic", 
                "Temperature Fahrenheit", "Respiratory Rate")

# Create an empty list to store plots
plots_vital <- list()

# Iterate over each demographic variable
for (var in vital_vars) {
  p <- ggplot(mimic_icu_cohort, aes(x = !!sym(var), y = los)) +
    geom_smooth() +
    labs(x = var) +
    theme_minimal()
  
  # Add the plot to the list
  plots_vital[[var]] <- p
}
# print the plots
grid.arrange(grobs = plots_vital, nrow = 3, 
  top = "Length of ICU Stay vs First Available Vital Measurements")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 74 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 993 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 996 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 1301 rows containing non-finite values (`stat_smooth()`).
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 148 rows containing non-finite values (`stat_smooth()`).

Q8.4 Length of ICU stay los vs first ICU unit

# Length of ICU stay `los` vs first ICU unit
# Visualize the distribution of los for each first ICU unit

# plot the bar plot for first ICU unit
ggplot(data = mimic_icu_cohort) + 
  geom_bar(mapping = aes(x = first_careunit, fill = los_group)) +
  labs(title = "Distributions of first ICU unit") +
  labs(x = "First ICU Unit", fill = "Length of ICU Stay (los)") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))